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Abstract 

QCMPI is a quantum computer (QC) simulation package written in Fortran 90 with 
parallel processing capabilities. It is an accessible research tool that permits rapid 
evaluation of quantum algorithms for a large number of qubits and for various 
"noise" scenarios. The prime motivation for developing QCMPI is to facilitate nu- 
merical examination of not only how QC algorithms work, but also to include noise, 
decoherence, and attenuation effects and to evaluate the efficacy of error correc- 
tion schemes. The present work builds on an earlier Mathematica code QDENSITY, 
which is mainly a pedagogic tool. In that earlier work, although the density matrix 
formulation was featured, the description using state vectors was also provided. In 
QCMPI , the stress is on state vectors, in order to employ a large number of qubits. 
The parallel processing feature is implemented by using the Message-Passing Inter- 
face (MPI) protocol. A description of how to spread the wave function components 
over many processors is provided, along with how to efficiently describe the action 
of general one- and two-qubit operators on these state vectors. These operators in- 
clude the standard Pauli, Hadamard, CNOT and CPHASE gates and also Quantum 
Fourier transformation. These operators make up the actions needed in QC. Codes 
for Grover's search and Shor's factoring algorithms are provided as examples. A 
major feature of this work is that concurrent versions of the algorithms can be eval- 
uated with each version subject to alternate noise effects, which corresponds to the 
idea of solving a stochastic Schrodinger equation. The density matrix for the ensem- 
ble of such noise cases is constructed using parallel distribution methods to evaluate 
its eigenvalues and associated entropy. Potential applications of this powerful tool 
include studies of the stability and correction of QC processes using Hamiltonian 
based dynamics. 
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Program Summary 



Title of program: QCMPI 
Catalogue identifier: 



Program summary ?7i2L.' http://cpc.cs.qub.ac.uk/summaries 



Program available from: CPC Program Library, Queen's University of Belfast, 
N. Ireland 

Operating systems: Any system that supports Fortran 90 and MPI; developed 
and tested at the Pittsburgh Supercomputer Center, at the Barcelona Super- 
computer (BSC/CNS) and on multi-processor Macs and PCs. For cases where 
distributed density matrix evaluation is invoked, the BLACS and SCALA- 
PACK packages are needed. 

Programming language used: Fortran 90 and MPI 

Number of bytes in distributed program, including test code and documenta- 
tion: 

Distribution format: tar.gz 

Nature of Problem: Analysis of quantum computation algorithms and the ef- 
fects of noise. 

Method of Solution: A Fortran 90/MPI package is provided that contains mod- 
ular commands to create and analyze quantum circuits. Shor's factorization 
and Grover's search algorithms are explained in detail. Procedures for dis- 
tributing state vector amplitudes over processors and for solving concurrent 
(multiverse) cases with noise effects are implemented. Density matrix and en- 
tropy evaluations are provided in both single and parallel versions. 
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1 INTRODUCTION 



Achieving a realistic Quantum Computer (QC) [T]|2] requires the control, mea- 
surement, and stability of simple quantum systems called qubits. A qubit is 
any system with two accessible states which can form a quantum ensemble. 
That ensemble can be manipulated to store and process information. Since 
quantum states can exist as superpositions of many possibilities, and since an 
isolated quantum system propagates without loss of quantum phases, a QC 
provides the advantage of being a "massively parallel" device and having en- 
hanced probability for solving difficult, otherwise intractable, problems. That 
enhancement is generated by constructive quantum interference. This ideal 
situation can be disrupted by external effects, which can cause the quantum 
system to loose its quantum interference capabilities-this is called decoherence 
and loss of entanglement. In addition, uncontrolled random pulses (noise 0) 
could strike the QC during its controlled performance and thereby its opera- 
tions or gates can be less than perfect. 



To gauge the efficacy of a QC, even when influenced by such external envi- 
ronmental effects, and to evaluate the positive influence of error correction [3] 
steps, it is important to have large scale QC simulations. Such simulations can 
only represent a small part of the full "massively parallel" quantum ensemble 
dynamics, since a real QC goes way beyond the capabilities of any classical 
computer. Nevertheless, it seems natural to invoke the best, most parallel and 
largest memory computers we have available. Therefore, we embarked on de- 
veloping a Fortran 90 parallel computer QC simulation, starting with the basic 
QC algorithms of quantum searching [5] and factorization [3]. Other authors 
have also attacked this problem to good effect [6H71l8]|9][T0] . Nevertheless, there 
is a need for a generally available, well-documented, and easy to use super- 
computer version, to encourage others to contribute their own advances. In 
addition, we have developed a broader range of applications and supercom- 
puter techniques than previously available. An important feature of our work 
is that we invoke the algorithms on concurrent groups of processors, which 
are then subject to different noise. Then, the overall density matrix is con- 
structed as an ensemble average over these noise groups. The density matrix 
can be stored on a grid of processors and its eigenvalues found using parallel 
codes, thereby avoiding the pitfalls of overly large matrix storage. Thus, we 
can evaluate the entropy, and indeed sub-entropies, for the dynamic evolution 
of a QC process in a simulation of a real world environment. 



There are various types of noise, such as thermal noise. We use the term noise in 
a generic sense, although specific noise models can be incorporated into QCMPI . 
^ Teleportation and superdense coding programs are also available, but were omit- 
ted for brevity. 
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Our code is called QCMPI to indicate that it is a QC simulation based on 
the Message-Passing Interface (MPI) [Il|[T2]. It is a Fortran 90 simulation of 
a Quantum Computer that is both flexible and an improvement over earlier 
such works [ 6.7.8;9..10j . The flexibility is generated by a modular approach to 
all of the initializations, operators, gates, and measurements, which then can 
be readily used to describe the basic QC Teleportation [13], Superdense cod- 
ing [H], Grover's search |5] and Shor's factoring [1] algorithms. We also adopt 
a state vector rather than a density matrix [15], approach to facilitate rep- 
resenting a large number of qubits in a manner that allows for general treat- 
ments, such as handling the dynamics stipulated by realistic Hamiltonians. We 
include environmental effects by introducing random stochastic interactions in 
separate groups of processors, that we dub multiverses. 

In section 121 we introduce qubit state vectors along with various state vector 
notations. We stress that a wave function component description allows for 
changes induced by simple one-body operators such as local quantum gates 
and also one-body parts of Hamiltonians. Examples are provided in section [3] 
of the affect of a general one-body operator on both two and more qubit sys- 
tems. Expansion in a computational basis, using equivalent decimal and binary 
labels, is used to demonstrate the role of operators on the state vector ampli- 
tudes. It is shown how to distribute a wave function over numerous processors 
and how to handle the fact that a one-body operator acts on wave function 
amplitudes in an manner that not only modifies amplitudes stored on a given 
processor, but also affects amplitudes seated on other processors. Criteria for 
locating the processors involved in these classes of operators are derived. Un- 
derstanding this combination of effects; namely, wave function distribution 
and the alteration of that distribution due to the action of a one-body oper- 
ator, is central to all subsequent developments. It is handled by careful MPI 
invocations and serves as a model for the extension to multi-qubit operations. 

In section m the MPI manipulations described earlier for the one qubit case 
are generalized and then the layout for the two-qubit operator alterations of 
the quantum amplitudes are clarified. With that result in hand, the particular 
two-qubit gates CNOT and CPHASE are readily constructed, as are two- 
body Hamiltonians for dynamical applications. Generalization to three-qubit 
operators, in particular to the ToffoUi gate, are obvious. 

In section Ism Grover's algorithm is discussed and it is shown how QCMPI allows 
one to simulate up to 30 qubits, (depending on the number of processors and 
available memory) in a reasonable time. 

Shor's algorithm is simulated using QCMPI as discussed in section ISTTl Several 



A state vector requires arrays of size 2"^, whereas a density matrix has a much 
larger size 2"« x 2"'''. Here Uq denotes the number of qubits. 
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standard codes that handle large-number modular and continued fraction ma- 
nipulations are provided, but the heart of this case is the Quantum Fourier 
Transform (QFT) and an associated projective measurement. The QFT is 
generated by a chain of Hadamards and CNOT gates acting on a multi-qubit 
register. It is shown how to do a QFT with wave function components dis- 
tributed over many processors. Here the benefit of using MPI is dramatic. 

In section ini the procedures invoked to describe parallel universes, subject 
to stochastic noise, is explained for both the Grover and Shor algorithms. 
For brevity, similar application to teleportation and superdense coding are 
omitted here (although also implemented using QCMPi). Also in section[6], the 
construction and evaluation of a density matrix is discussed in two ways. 
In one way, the full density matrix is stored on the master processor and 
its eigenvalues and the associated entropy is evaluated using a linear code 
subroutine. In the second, more general way, the density matrix is spread over 
many processors on a BLACS constructed processor grid and eigenvalues and 
entropy determined using the parallel library SCALAPACK [16]. The later 
version reduces the storage needs and enhances speed. 

A brief description of the included routines is given in section [71 and finally 
some conclusions and future developments are discussed in section [HI 



2 STATES 



2.1 One-Qubit States 

The state of a quantum system is described by a wave function which in gen- 
eral depends on the space or momentum coordinates of the particles and on 
time. In Dirac's representation-independent notation, the state of a system is 
a vector in an abstract Hilbert space | "^(t)), which depends on time, but in 
that form one makes no choice between the coordinate or momentum space 
representation. The transformation between the space and momentum repre- 
sentation is contained in a transformation bracket. The two representations are 
thus related by Fourier transformation, which is the way Quantum Mechanics 
builds localized wave packets. In this manner, uncertainty principle limita- 
tions on our ability to measure coordinates and momenta simultaneously with 
arbitrary precision are embedded into Quantum Mechanics (QM). This fact 
leads to operators, commutators, expectation values and, in the special cases 
when a physical attribute can be precisely determined, eigenvalue equations 
with Hermitian operators. That is the content of many quantum texts. Our 
purpose is now to see how to define a state vector, to describe systems or 
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ensembles of qubits as needed for quantum computing. Thus, the degrees of 
freedom associated with change in location are suppressed and the focus is on 
the two-state aspect. 

Spin, which is the most basic example of two-valued quantum attribute, is 
missing from a spatial description. This subtle degree of freedom, whose ex- 
istence is deduced, inter alia, by analysis of the Stern-Gerlach experiment, is 
an additional Hilbert space vector feature. For example, for a single spin 1/2 
system the wave function including both space and spin aspects is: 

*(fi,t) I s m,), (1) 

where | s m^) denotes a state that is simultaneously an eigenstate of the 
particle's total spin operator s"^ = si + Sy + si, and of its spin component 
operator s^. That is 

I STUs) = h'^s{s + 1) I snis) Sz I STUs) = HtUs I snis) . (2) 

For a spin 1/2 system, we denote the spin up state as | snis) |, |) =| 0), 
and the spin down state as | srris) — >| |, — |) =| 1). 

A simpler, equivalent representation IS cLS cL two component amplitude 




This matrix representation can be used to describe the two states of any 
quantum system and is not restricted to the spin attribute. In this matrix 
representation, the Pauli matrices a are: 



0". 



(4) 



1 \ 1 M f 

CTx > \ , (Ty > \ 

-Ij \1 } \i 

These are all Hermitian matrices cxj = aj. Along with the unit operator X = cxo 

/ 



J = (To 



1 
1 



(5) 



any operator acting on a qubit can be expressed as a combination of Pauli 
operators. 
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Operators on multi-qubit states can be expressed as linear combinations of 
the tensor product Ll| of the Pauh operators. For example, a general operator 
fl can be expressed as 

3 3 

^ = I] • • • H Ai,i2-in, Wh fT,„J, (6) 

n=0 inq=0 

where Ai,i2---jnq is in general a set of complex numbers, but are real numbers 
for hermitian Q. 

A one qubit state is a superposition of the two states associated with the above 
and 1 bits: 

|vl/) = Co|0) + Ci|l), (7) 

where Co = (0 | and Ci = (1 | are complex probability amplitudes for 
finding the particle with spin up or spin down, respectively. The normalization 
of the state | ^) = 1, yields | Cq P + | Ci P= 1. Note that the spatial 
aspects of the wave function are being suppressed; which corresponds to the 
particles being in a fixed location, such as at quantum dots[£]. A 2 x 1 matrix 
representation of this one-qubit state is thus: 




An essential point is that a QM system can exist in a superposition of these 
two bits; hence, the state is called a quantum-bit or "qubit." Although our 
discussion uses the notation of a system with spin 1/2, it should be noted 
again that the same discussion applies to any two distinct states that can be 
associated with | 0) and | 1). 

2.2 Multi-qubit States 

A quantum computer involves more than one qubit; therefore, we generalize 
the previous section to multi-qubit states. 

For more than one qubit, a so-called computational basis of states is defined 



^ A tensor product of two matrices B is defined by the rule: {qi,qj \ A i? | 
Qs,qt) = {qi \ ^ \ qs){qj I B \ qt), with obvious generalization to multi-qubit 
operators. 

® When these separated systems interact, one might need to restore the spatial 
aspects of the full wave function. 



8 



by a product space 

I =1 gi)--- 1 gn,) =1 Q), (9) 

where denotes the total number of qubits in the system. We use the conven- 
tion that the most significant qubit is labeled as qi and the least significant 
qubit by g^^. Note we use qi to indicate the quantum number of the ith qubit. 
The values assumed by any qubit are limited to either = or 1. The state la- 
bel Q denotes the qubit array Q = {qi,q2,--- , ^n,) , which is a binary number 
label for the state with equivalent decimal label n. This decimal multi-qubit 
state label is related to the equivalent binary label by 

riq 

n = q,- 2"'-i + q2 ■ 2^"-' + . . . + . 2° = " 2"'"* . (10) 

i=l 

Note that the ith qubit contributes a value of ■ 2"9~* to the decimal number 
n. Later we will consider "partner states" (| Hq), \ fii)) associated with a given 
n, where a particular qubit ig has a value of g^^ = 0, 

no = n - g,^ ■ 2"^-^% (11) 

or a value of g^^ = 1, 

^1 = « - (gi. - 1) • 2"^"'^ (12) 

These partner states are involved in the action of a single operator acting on 
qubit is, as described in the next section. 

A general state with Ug qubits can be expanded in terms of the above compu- 
tational basis states as follows 

l^)n, = E^QlQ)= E Cn In), (13) 

Q n=0 

where the sum over Q is really a product of Ug summations of the form J2q,=o,i ■ 
The above Hilbert space expression maps over to an array, or column vector, 
of length 2"" 



Co 



or with binary labels 



^0-00 



Cq..., 



01 



• (14) 



\Ci...uJ 



^ An important aspect of relating the individual qubit state to a binary represen- 
tation is that one can maintain the order of the qubits, since if a qubit hops over to 
another order the decimal number is altered. 
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The expansion coefficients C„ (or Cq) are complex numbers with the physical 
meaning that C„ = {n \ is the probability amplitude for finding the 

system in the computational basis state | n), which corresponds to having the 
qubits pointing in the directions specified by the binary array Q. 

Switching between decimal n and equivalent binary Q labels is accomplished 
by the simple subroutines bin2dec and dec2bin in QCMPI . There we denote the 
binary number by an array B{1) ■ ■ -Binq). The routines are: 



call bintodec(nq,B,D) 



call dectobin(nq,D,B) 



where nq is the number of qubits; D is a real decimal number and B is the 
equivalent binary array. 



3 ONE-QUBIT OPERATORS 



An important part of quantum computation is the act of rotating a qubit. The 
NOT and single qubit Hadamard H. operators are of particular interest: 



NOT 



/o 1^ 



1 



H 



^/2 



1 

71 



1 1 
1 -1 



• (15) 



These have the following effect on the basis states NOT | 0) =| 1), NOT | 
1)= |0),and7i|0) = M),7^|l) = M). 

General one-qubit operators can be constructed from the Pauli operators; we 
denote the general one-qubit operator acting on qubit s as Q^. Consider the 
action of such an operator on the multi-qubit state | : 



Q 

= E ••• E ••• E Ui)--- i^s\qs)) 

91=0,1 93=0,1 =0,1 



(16) 



(17) 



Here VLs is assumed to act only on the qubit is of value g^. The ils\ Qs) term 
can be expressed as 

^s\qs)= E \Q's)Ws\^s\qs), (18) 

9^=0,1 
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using the closure property of the one qubit states. Thus Eq. (1T7I) becomes 



^.|*)n,=ECQf^s|Q)= (19) 
Q 

■■■ ■■■ Y II C'q Ws\^s\qs) I gi) ■ ■ ■ \ q's)---\Qn,)- 

51=0,1 -?s=0,l qnq=0,l q'^=0,l 

Now we can interchange the labels Qs ^ q's, and use the label Q to obtain the 
algebraic result for the action of a one-qubit operator on a multi-qubit state 

I ^)n, = EC'q I Q) = E C'n I n), (20) 
Q n=0 

where 

CQ=Cn= Y{(ls\^s\q's) Cq,, (21) 
9^=0,1 

where Q = (gi, g2, ■ " " ^n,) , and Q' = (qi, ■ • • • • • gn,) • That is Q and Q' are 
equal except for the qubit acted upon by the one-body operator Qg- 

A better way to state the above result is to consider Eq. (1211) for the case that 
n has Qs = and thus n ^ Hq and to write out the sum over q'^ to get 

Cno = (0 I fi. I 0)Cno + {0\n,\ l)Cn„ (22) 

where we introduced the partner to riQ namely rii. For the case that n has 
Qs = 1 and thus n —>■ rii Eq. fl2Tl) . with expansion of the sum over q'^ yields 

Cn, = (1 \ns\0)Cn, + {l l)Cn, , (23) 

or, written as a matrix equation, we have for each no, rii partner pair 

/ 



Cno 
Cm 



(0 1 a 1 0) (0 1 a I 1) 
\^(l|^], |o) (l|^^. |i) 




(24) 



This is not an unexpected result. Later we will denote the matrix element 
(0 I fis I 0) as VLsqq, etc. 

Equation (12^ above shows how a 2 x 2 one-qubit operator VLs acting on qubit 
is changes the state amplitude for each value of no. Here, no denotes a decimal 
number for a computational basis state with qubit ig having the value zero 
and ni denotes its partner decimal number for a computational basis state 
with qubit ig having the g^ value one. They are related by 

nl = no + 2"«-^^ (25) 
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At times, we shall call 2"''"*'' the "stride" of the ig qubit; it is the step in n 
needed to get to a partner. There are 2""'/2 values of Uq and hence 2"'' /2 pairs 
no,ni. Equation fl24p is applied to each of these pairs. In QCMPI that process 
is included in the subroutine OneOpA. 

Note that we have replaced the full 2""' x 2""' one qubit operator by a series 
of 2"9/2 sparse matrices. Thus we do not have to store the full 2'""' x 2""^ but 
simply provide a 2 x 2 matrix for repeated use. Each application of the 2x2 
matrix involves distinct amplitude partners and therefore the set of 2 x 2 
operations can occur simultaneously and hence in parallel. 

In the next section, the procedure for distributing the state vector over several 
processors is described along with the changes induced by the action of a one- 
body operator. Later this procedure is generalized to multi-qubit operators, 
using the same concepts. 

3. 1 Distribution of the State 

The state of the multi-qubit system is described at any given time by the array 
of coefficients Cnit) for n = 0, • ■ ■ 2"'' — 1, see Eq. (IT^ . The action of a one- 
qubit gate, assumed to act instantaneously, is specified by the rules discussed in 
the previous section. Now we wish to distribute these state-vector coefficients 
stored in "standard order" with increasing ra, over a number of processors 
Np. For convenience, we assume that the number of processors invoked is a 
power of two, i.e. Np = 2^ and thus we can distribute the C„ coefficients 
uniformly over those processors with A^^ = 2""'/Np = 2'^'"^ amplitudes on 
each processor. In the code we denote A^^^ as NPART. So, for example, we 
place 

Cq - ■ ■ Cn^-i on processor myid = 0; (26) 

Cn^ ■ ■ ■ C2N^-i on processor myid = 1; 

C{Np-i)N^ ■ ■ ■ CnpN^-1 on processor myid = Np - 1 . 

Where myid is the processor number, from to Ap — 1. Note that Np ■ N^ = 
2""^. This distribution of the state over the Np processors places a demand 
of 2"«~^ on the memory of each processor. For 64 processors p = 6 and the 
memory required is down by a factor of 64; and for 4096 processors p = 12 
and the memory required is down by a factor of 4096, etc. As the number of 
processors available increases, so will the memory demands on each processor. 

However, life is not that simple. A one-qubit operator for a given partner pair 
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no,ni often involves a C„(, that is seated on one processor and a that 
is seated on another processor. We need to deal with that situation, while 
respecting the scheme for standard order distribution of the amplitude coef- 
ficients. The first question that arises is when are the pairs C^no 7 seated 
on the same processor? We call that being "seated in the same section," in 
analogy to theater seating. That is, we dub being located on a particular pro- 
cessor as having the same section, with the location of a particular amplitude 
within that section being called its "seat." With that language, it is simple 
to state the condition for an amplitude pair C„o,C„i being on the same pro- 
cessor; namely, that the difference (we call this the "stride") tiq — rii = 2"«~*'' 
be less than the distance 2""3/Np = 2"''"^ or simply is > p. If this condition 
is not satisfied, the stride is large enough to jump out of the section and thus 
require inter-processor communications. This result holds true if the number 
of processors is of the form Np = 2^ = 1, 2, 4, 8, 16 • • • . One can prove this 
rule by induction. 

This condition ig > p, indicates that the larger ig, that is for qubits that 
are the least significant contributors to the state label n, the associated pairs 
of amplitudes reside on the same processor. In contrast, the smaller ig are 
the qubits for which the pair amplitudes are the furthest away in processor 
number. The stride ranges from a value of 1 for ig = Uq (least significant qubit) 
to 2"«/2 for ig = 1 { most significant qubit). Carrying out the 2x2 matrix 
multiplication Eq. fl2^ is simple for those pairs on the same processor, but 
suitable transfer to the requisite processors must be implemented before one 
can perform the requisite 2x2 matrix multiplication. To carry out that process 
requires a way to identify the processor (e.g. the section assignment) and the 
location within that processor (the seat ) and to interchange the amplitudes. 
The latter task is carried out using the MPI protocol, as discussed later. 

3.2 Pair Section, Seat and MPI 

Distribution of the 2"« amplitudes Cq - ■ ■ C2"9_i over the Np processors, places 
iVj. = 2'^'i/Np = 2"«^^ amplitudes on each processor. As the state label n 
ranges from to 2"« — 1 one jumps between different processors. The relation- 
ship between the n label and the processor on which the associated amplitudes 
sits is simply: section = lnt{n/Nx), where Int() means the integer part and the 
seat (i.e. location within that processor) is seat = Mod(?T,, A^^.) which denotes 
modular arithmetic of base N^. In the code is called NPART and section 
is identical with myid, the processor number. 

With the ability to identify the processor/location or section/seat assignment 
associated with each index n, the remaining task is to transfer the requisite 
amplitudes to the "correct" location. That task is carried out by the Message 
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Passing Interface( MPI ) commands MPI_SEND and MPI_RECV. We need 
to coordinate the various processors and exchange data during a calculation. 
The main reason MPI was developed over the last several decades is to enable 
efficient communication between processors during a computation. 

Why use MPI? The MPI protocol affords many advantages for developing 
parallel processing codes. The main advantages are that: (1) MPI provides 
a standard set of routines that are easy to use and (2) MPI is flexible and 
works on many platforms. Thus MPI proved perfect for our need to develop 
a user-friendly, flexible realization of the action of multi-qubit operators on 
state vectors in a parallel computing environment. 



3.3 Action of One-Qubit Operator 

The following figures (Figs. [T]- [2]) illustrate the role played by MPI in trans- 
ferring distributed amplitudes to appropriate processor locations when the 
one-qubit operator acts. We use the case of = 3 or 2^ = 8 components as a 
simple example. 

The first case takes the partner labels n = 1,3, which corresponds to the binary 
numbers (001) and (Oil). Here we use the binary labels for the components 
and consider the special case of a one-qubit operator acting on qubit 2 and 
assuming two processors p = 1 (see Fig. [1]). For that case, the two amplitudes 
affected by the one-qubit operator reside on the same processor, i.e., they have 
just different seats in the same section. Thus there is no need for MPI data 
transfer. 

Now consider the partner labels n = 0,4, which correspond to the binary 
numbers (000) and (100). Again we use the binary labels for the components, 
but now consider the special case of a one-qubit operator acting on qubit 1 
and again assuming two processors. For this case, the two amplitudes affected 
by the one-qubit operator do not reside on the same processor, i.e., they are in 
different sections. Thus there is now an essential need for MPI data transfer, 
which involves sending and receiving as illustrated in Fig. [2l This entails two 
sends and two receives. 

Of course, one needs to continue this process for the other three amplitude 
pairs = 1, 5, n = 2, 6, and n = 3,7. In general, we have 2""'/2 partner pairs. 
Those pairs require six more sends and six more receives. An important issue 
is then to see if the time gained by invoking more processors wins out over 



We have run our codes on the Pittsburgh and Barcelona supercomputers, and 
also on arrays of imacs. 
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Processor 1: 



Processor 2: 



Processor 1: 



x: 



Cooo = Cm 
Coio = Cqio 

Con = ^sioCooi + ^siiCoii 



Processor 2: 

Cioo = Cioo 
Cioi = Cioi 
Cm = Clio 
Cm = Cm 



Fig. 1. A three qubit state vector is acted on by a one-qubit operator on qubit 2 
{is = 2) . The case illustrated is for the partners n = 1,3, which correspond to the 

binary numbers (001) and (Oil). It is assumed that there are just two processors 
Np = 2^, with p = 1. Thus is > p for this case and the two coupled amplitudes 
reside on the same processor and no MPI data transfer is invoked. 




Processor 1: 



■■ Cooo = ^sOoCooo + ^sOlCioo 
Cool = Cool 
Colo = Coio 
Con = Con 

Processor 2: 

Cioo = ^^.sioCqoo + f^snCioo 
Cioi = Cioi 
Cue = Clio 
Cm = Cm 



Fig. 2. A three qubit state vector is acted on by a one-qubit operator on qubit 1 

{is = 1) • The case illustrated is for the partners n = 0, 4, which corresponds to the 
binary numbers (000) and (100). It is assumed that there are just two processors 
Np = 2P, with p = 1. Thus the condition ig > p is not satisfied and indeed the two 
coupled amplitudes reside on different processors and MPI data transfer is invoked. 
We need to send (Si) component Coco to processor one, and it is received at (i?i), 
and also send {S2) component Cioo to processor zero, and it is received at (-^2)- 
Later we will specify the send and receive commands in the MPI language. 

the time needed for all of these MPI transfers. Another important concept is 
one of "balance," which involves the extent to which the various processors 
perform equally in time and storage (ideally we assume they are all exactly 
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equivalent and in balance). 



It is important to understand the above illustrations, because for more qubits 
and more processors and for two- and three- qubit operators, the steps are 
simply generalizations of these basic cases. Careful manipulation of the am- 
plitudes, allows for spreading the amplitudes over many processors and using 
MPI to do the requisite data transfers for all kinds of operators and gates. 

For the one qubit case, the steps here are called by the command 



call OneOpA(nq,is,Op,psi,NPART, COMM) 



where nq denotes the number of qubits; "is" labels the qubit acted on by 
operator "Op," psi is an input wave function array of length NPART=A^^., 
which is returned as the modified state vector. The last entry COMM is 
included to allow for later extension to separate communication channels that 
we refer to as parallel universes or multi verses. 

Let us emphasize that any operator, acting on one qubit is a special case of the 
one described here. Thus all rotations, all so-called local operations, including 
those generated by the one-body part of Hamiltonian evolution, are covered 
by the code OneOpA. 



4 MULTI-QUBIT OPERATORS 



Let us return to the main issue of how to distribute the amplitudes over several 
processors and to cope with the action of operators on a quantum state. The 
case of a two-qubit operator is a generalization of the steps discussed for a 
one-qubit operator. Nevertheless, it is worthwhile to present those details, as 
a guide to those who plan to use and perhaps extend QCMPI . 

We now consider a general two-qubit operator that we assume acts on qubits 
is-^ and is2, each of which ranges over the full 1, ■ ■ ■ .Ug possible qubits. General 
two-qubit operators can be constructed from tensor products of two Pauli 
operators; we denote the general two-qubit operator as V. Consider the action 
of such an operator on the multi-qubit state | \E')„g : 
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V|^)n, = E^Q VI Q) (27) 
Q 

= E--- E ■■■ E C'q |gi)---(V|g.ig.2)) ■■■|g„,). 

gi=0 g3i,(?s2=0 gn„=0 



Here V is assumed to act only on the two isi,is2 qubits. The (V| g^i Qs2)) term 
can be expressed as 

1 

V| g.i = E I ^si gs2)(g^i I V| gsi (28) 
9^1. 9^2=0 

using the closure property of the two-qubit product states. Thus Eq. ( |28!) 
becomes 



v|^)n,=E^QV|Q)= E--- E ■■■ E ■■■ E E (29) 

Q (31=0 <?sl=0 932=0 <J„q=0 g^^,<j^2=0 

Cq {qsiq's2 I V I qsiqsi) \ qi) ■ ■ ■ \ q'sM " " " Un,)- 

Now we can interchange the labels g^i ^ q'si,qs2 ^ q's2 and use the label 
Q to obtain the algebraic result for the action of a two-qubit operator on a 
multi-qubit state 



2"9-l 

V|^k=EC'Q |Q)= E Cn \ n), (30) 

Q n=0 

where 

1 

CQ = Cn= E {qsiqs2 I I q'siq's2) Cq', (31) 

where Q = (gi, g2, • " • , and Q' = (gi, ■ ■ ■ q'si ■ ■ ■ q's2- ' ' qng) ■ That is Q and 
Q' are equal except for the qubits acted upon by the two-body operator V. 

A better way to state the above result is to consider Eq. (13T|) for the following 
four choices 



'^oo- 


Hqi- 


■■qsi 


= 0- 


■■qs2 


= 0,- 


■■gnj 


noi - 


Hqi- 


■■qsi 


= O-' 


■■qs2 


= 1,- 


■■qn,) 


nw- 


Hqi- 


■■qsi 


= I-' 


■■qs2 


= 0,- 


■■qn,) 


nii - 


Hqi- 


■ - qsi 


= I-' 


■■qs2 


= 1,- 


■■qn,), 



(32) 

where the computational basis state label Uq^^^q^^ denotes the four decimal 
numbers corresponding to Q = (gi, ■ ■ ■ g^i ■ ■ ■ gs2 ■ ■ ■ gn„)- 
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Evaluating Eq. (13T!) for the four choices Eq. (132!) and completing the sums 
over q'g^, g^25 the effect of a general two-qubit operator on a multi-qubit state 
amplitudes is given by a 4 x 4 matix 



(c \ 




VoO;00 


VoO;01 


VoO;10 


VoO;ll^ 




Ic \ 






Voi;00 


Voi;01 


Vol; 10 


Voi;ll 




r 






VlO;00 


VlO;01 


VlO;10 


VlO;ll 










\ Vll;00 


Vll;01 


Vll;10 


Vll;ll j 




\C'nii / 


= {hi 


1 V 


1 Kl). 


Equation fl33l) 


above shows how 



(33) 



where Vij-u 

qubit operator V acting on qubits %s\, is2 changes the state amplitude for each 
value of noo- Here, noo denotes a decimal number for a computational basis 
state with qubits isi,is2 both having the values zero and its three partner 
decimal numbers for a computational basis state with qubits igi, is2 having the 
values (0, 1), (1,0) and (1, 1), respectively. The four partners noo, "n-oi, ""-lo, ''^ii, 
or "amplitude quartet," coupled by the two-qubit operator are related by: 

noi = noo+2"'-'»^ n^o = noo+2"'-'=^ = r^oo+2"'-'=^+2"«-'=^ (34) 

where is2-,is2 label the qubits that are acted on by the two-qubit operator. 

There are 2"«/4 values of noo and hence 2"9/4 amplitude quartets noo, ""-oi, '^lo, '^ii- 
Equation fl33|) is applied to each of these quartets for a given pair of struck 
qubits. In QCMPI that process is included in the subroutine TwoOPA. 

In this treatment, we are essentially replacing a large sparse matrix, by a 2"''/4 
set of 4 X 4 matrix actions, thereby saving the storage of that large matrix. 

In the next section, the procedure for distributing the state vector over several 
processors is illustrated along with the changes induced by the action of a two- 
body operator. 



4-1 Action of Two- Qubit Operators 



To visualize the distribution of the amplitudes over several processors and the 
role played by MPI in transferring the amplitudes to appropriate location, 
when the two-qubit operator acts, the following diagrams lay out the scheme. 
We again use the case of n^ = 3 or 2^^ = 8 components as a simple illustration. 

The first case takes the amplitude quartet labels n = 0,1,2,3 which corre- 
sponds to the binary numbers (000), (001), (010), and (Oil). Here we use the 
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Processor 1: 




Processor 1: 

" Cqoo 
" Cooi 
- Coio 



c. 



nil 



Processor 2: 



Processor 2: 




C'loo 


= Cioo 


C'loi 


= Cini 


Clio 


= Clin 


Cm 


= Cm 



Fig. 3. A three qubit state vector is acted on by a two-qubit operator on qubits 2 and 
3 {isi = 2,is2 = 3) . The case illustrated is for the amplitude quartet n = 0, 1,2,3, 
which corresponds to the binary numbers (000), (001), (010), and (Oil). It is as- 
sumed that there are just two processors Np = 2^, with p = 1. Thus is2 > isi > P 
for this case and the two coupled amplitudes reside on the same processor and no 
MPI data transfer is invoked. The dashed circles indicate that all four amplitudes 
contribute to forming the values of Cqooj C'ooIj C'oiOj Cqu are given by Eq. (f3T|) . 



binary labels for the components and consider the special case of a two-qubit 
operator acting on qubits 2 and 3. We consider just two processors. In this 
case the four amplitudes affected by the two-qubit operator reside on the same 
processor, i.e., they have just different seats in the same section. Thus there 
is no need for MPI data transfer. 



Now consider the amplitude quartet labels n = 0,2,4,6, which corresponds 
to the binary numbers (000), (010), (100), and (110). Again we use the binary 
labels for the components, but now consider the special case of a two-qubit 
operator acting on qubits 1 and 2. We consider just two processors. For this 
case, the two amplitudes affected by the one-qubit operator do not reside on 
the same processor, i.e., they are in different sections. Thus there is now an 
essential need for MPI data transfer, which involves sending and receiving as 
illustrated in Fig. HI 

For the two qubit case, the steps here are called by the command 



call TwoOpA(nq,isl,is2,Op,psi,NPART,COMM) 



where nq denotes the number of qubits; isl,is2 label the qubits acted on by 
operator Op, psi is an input wave function array of length NPART=A^2;, 
which is returned as the modified state vector. 
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Processor 1: 




Processor 1: 

Cool 
Con 



Processor 2: 




Processor 2: 



ClOO 

Cioi = Cioi 
Clio 

Cm = Cm 



Fig. 4. A three qubit state vector is acted on by a two-qubit operator on qubits 1 and 
2 {ia\ = l,is2 = 2) . The case ihustrated is for the amphtude quartet n = 0,2,4,6, 
which corresponds to the binary numbers (000), (010), (110), and (110). It is as- 
sumed that there are just two processors Np = 2^, with p = 1. Thus is2 > p, but we 
do not have igi > p, thus for this case amplitudes reside on different processors and 
MPI data transfer is invoked. The dashed circles indicate that all four amplitudes 
are to be sent/received from other locations. 

4.2 CNOT and CPHASE 



The two-qubit operators CNOT and CPHASE are oft-used special cases of 
the above two-qubit operator discussion. They are simpler than the general 
case because they are given by the sparse matrices 



V ^ CNOT = 



1 

10 

1 

yO 1 



V ^ CPHASE 



1 

1 

1 








-1 



CNOT stores the rule 00 00, 01 



(35) 

10, where qubit 1 is 



01,10^ 11,11 

the control, and qubit 2 gets acted on by fJi only when qubit 1 has a value of 
one. In QCMPI , a subroutine CNOT codes this special two-qubit operator: 



call CNOT(nq,isl,is2,psi,NPART,COMM) 



CPHASE stores the rule 00 ^ 00, 01 01, 10 ^ 10, 11 ^ -11, where qubit 
1 is the control, and qubit 2 gets acted on by only when qubit 1 has a 
value of one. In QCMPI , a subroutine CPHASEA codes this special two-qubit 
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operator: 



call CPHASEA(nq,isl,is2,psi,NPART,COMM) 



Another two-qubit operator which plays a key role in the quantum Fourier 
transformation, is the CPHASEK, given by a sparse matrix that depends on 
a positive integer k 



V CPHASEK 



1 ^ 
10 

10 







(36) 



In QCMPI , a subroutine CPHASEK codes this special two-qubit operator: 



call CPHASEK (nq,isl,is2,k, psi,NPART,COMM) 



Note that there are no MPI commands required in this subroutine. 



4-3 The Full Hadamard-Special Handling 



An important example of a multi-qubit operation is when Hadamards act on 
all of the qubits in a system-a step that is often used to initialize a QC. One 
way to do that is simply to repeat the prior discussion and use the subroutine 
for qubits is — l---ng. That procedure is implemented in the subroutine 
HALL. 

Hadamards acting on all qubits involves the operator 



ni®n2---®nn, . (37) 



Another way to implement this operator is based on the realization that when 
acting on the column vector (Cq • • • C2"9_i) it forms an equal weighted com- 
bination with particular signs s„.n', whereby the effect of is 

2"'2-l 

Cn — ^ —n^ ^ Sn,n'Cn' = C'n ■ (38) 
2 2 n'=0 

The task is to determine the signs s„,„'. These signs are relatively simple to pin 
down. From the product structure of it is simple to show that the signs 
are determined by the condition Sn,n' = {~^Y, where 5 denotes the number 
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of times the binary bits for n, n' of unit value are equal, i.e. how many times 
B{i) = B'{i) = 1. This condition is carried out in the function SH: 



FUNCTION SH(nq,n,np) 



where nq is the number of qubits; np = n'. This procedure is implemented 
in the subroutine HALL2. The user should decide which version HALL or 
HALL2 works best in their context. 

Ironically, although a small subroutine, HALL or HALL2 is used repeatedly 
in Grover's search and dominate the time expended in a large qubit quantum 
search. We shall refer to the operator Ti.^'' as HALL throughout this paper, 
recognizing that it can be implemented using either CALL HALL or by the 
special sign handling method CALL HALL2. 



5 A SAMPLE OF RELEVANT QUANTUM ALGORITHMS 

QCMPI permits the simulation of any quantum circuit on a parallel computing 
environment. In this section we describe two well-known QC algorithms al- 
ready included in the current package and which exemplify the use of QCMPI in 
practical applications. 



5.1 GROVER's searching algorithm 

We now show how to apply the operators (gates) , and the treatment of 
a multi-qubit state, to the first of several basic QC algorithms. These are 
standard procedures in QC and we examine them with QCMPI so that one can 
describe these algorithms dynamically using basic, realistic Hamiltonians and 
also subject these procedures to environmental effects. The case of superdense 
coding, [13] which is a way to enhance communication between Alice and Bob 
by means of shared entangled states., has also been developed in QCMPI. 

Our first application presented here is Grover's search algorithm [5j. In this 
case, we start with a state of qubits all pointing up | 000 ■ ■ ■ 0) , and act on 
it with HALL, see Eq. (jSZD- Then, we need an all-knowing Oracle operator 
Oracle to mark an item that is to be ferreted out by the algorithm. The Oracle 
step is very simple when we use amplitude coefficients C{n); we simply find 
the processor (section) and location on that processor (seat) associated with 
the marked item and reverse the sign of that amplitude C{nx) — > —C^n^). 
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All other amplitudes are unchanged. 



Ic, \ (c„ \ 



Oracle 



(39) 



This process can be extended to two or more marked items. 

Grover's procedure, which entails using constructive interference to make the 
marked item's amplitude stand out from all others, involves acting repeatedly 
on the state HALL| 000 • • • 0) with the "Grover operator" 



^? = HALL ■ I\Q ■ HALL • Oracle 



(40) 



where X\C is an operator 



i\n^2 \ ooo---o)(ooo---o I -I 



(41) 



where I is an 2"'^ x 2"'^ unit matrix. The operator X\C is simply realized by 
changing the sign of all amplitudes except for the n — one. 



Co 

C2"9-l 



-C2 
—Co'" 



(42) 



The combination HALL ■ X\C ■ HALL is called an inversion about the mean 
and is an essential part of Grover's algorithm. The other essential part is to 
act on the initial state enough times rit to produce an amplitude C(nx) that 
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stands out with high probabihty. We take the number rif = ^v2^ 








C 



1 



G"* • HALL I 000 ■ ■ • 0) 



(43) 



C 



In all of these simple states, it is the HALL operator and its repeated appli- 
cation in G"* that involves the most time expenditure. 

The QCMPI Grover search for a large number of qubits is included as Guniv.f90. 
Instructions for running the code and a guide to steps invoked are incorporated 
directly as comments in the listing. Generalization to include noise using a 
multiuniverse approach is discussed later. 

5.2 SHOR's factoring algorithm 

Shor's algorithm [1] is a QC method for factoring a large number. The basic 
idea is to prepare a state that, when subjected to a Quantum Fourier Trans- 
form (QFT), permits one to search for the period of a function that reveals 
the requisite factors with high probability. It uses quantum enhancement to 
go way beyond classical factoring procedures and yields the factors with high 
probability for very large non-prime numbers, after relatively few tries. To sim- 
ulate this algorithm, where we are restricted to numbers much smaller than 
possible in a future QC, there are several steps implemented in the QCMPI 
code subshor.fQO. A pedagogic analysis of the reasons for each step of Shor's 
algorithm is presented in reference . 

Step 1: Choose the number, M, and set the register sizes 

Choose the number, M, to be factorable number: 15, 21, 33, 35, 39, 55, 77 ■■ ■ 
and determine the size of two requisite registers. 

Preparatory tests on the input number M are made so that the code continues 
only if the input number is not a power of 2 or a prime and is thus a suitable 
candidate for factoring. These are classical procedures performed by standard 
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codes. Then, based on having an acceptable input number, an initial state of 
two distinct registers are prepared, with register one having rii and register 
two having n2 qubits. The first register should [HIT^ have enough qubits to 
store in base 2 all numbers in the range to 2M^, i.e. > 2"! < 2M^. 
Therefore the choice for rii is set as 

ni = Ceiling(log2(g) ) Q = 2C-ii'^s(i°S2(*^') ) (44) 

where Ceiling(x) gives the smallest integer greater than or equal to x.0 

The number of qubits in register two is set by 

ri2 = Ceiling (log2(M) ) (45) 

so that there are enough qubits to store in base 2 all numbers up to and 
including the value of the input number M. 

Here we invoke the minimum number of qubits for both registers. A larger ni 
lengthens the computation, albeit providing higher probability of success. 



Step 2 : Load the first register 

Load the first register with all the integers less than or equal to 2"^^ — 1. 

This is achieved by acting with a Hadamard on all qubits in register one, that 
is use HALL on a basis state of ni qubits so that register one is set to the 

state 

2"i-l 

I ^i) = HALL I 00---0)„, = — (46) 

Thus each of the 2"^ amplitudes appear with equal weight, which is the quan- 
tum massive parallel processing feature. 

Next we attend to setting register two. 



^ For example, if M = 21, and = 441, then ni = 9 and 2^ = 512, and register 
one includes the value 441. If ni were taken as 8, then register one would not include 
the value 441, since 2^ = 256. If ni were taken as 10, then register one would exceed 
the value 2M'^ = 882, since 2^° = 1024. 

For example, if M = 21, then n2 = 5 and 2^ = 32 > M, and register two includes 
the value 21. If n2 were taken as 4, then register two would not include the value 
21, since 2^ = 16. 
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Step 3 : Load the second register 

Select an integer (xguess) coprime to M and load the function /(j) = xguess-' {modM) 
into the second register for a fixed choice of xguess and all possible j values: 
< j < 2"! - 1. 

Note that in QCMPI, we simply compute | f{n)) and tack it on to the | n)^^ 
state. As discussed by Shor [4], one must actually do this crucial step by a 
quantum process for modular exponentiation. 

In Shor's algorithm, xguess is a random choice for a number that is coprime 
to M, where coprime means that M and xguess have no common factor other 
than 1. Euler's phi function of M is used to determine the range of integers 
between 1 and M that are coprime to M; a number in this interval a is selected 
and then tested using GCD[a, M] = 1 to assure that it is coprime to M. If it 
passes that test we set the value for a — > xguess. 

The reason for defining the above function (aka the Shor Oracle) is that this 
function /(j)has a characteristic period for each value of M and xguess. 

I f[n + r)) =1 /(n)), f{n) = xguess" (modM). (47) 

Finding the period r is a key goal. 

The full state composed of registers 1 and 2 {ug = rii + 722) is built in the 
following way: 

I ^)". = 7;!ir \ I /H)n2 • (48) 

2 2 n=0 

Step 4 : Measure register 2 

We could measure the second register next or postpone that act to coincide 
with step 6 below, because step 5 involves register 1 only. It is helpful to think 
of the action of measuring register 2 now to motivate the need for step 6. 
For each possible value of ni k, one asks if register 2 is in state n2{k \ by 
projecting the full state 

^ 2"i-l D-l 

nik I ^)n, = -IT II I n)nAk \ firi))^^ 7T II I ^fc + J>)ni, (49) 
2 2 n=0 j=0 

where at the last stage the state is normalized after projection-the usual Born 
rule for a projective measurement. Note that for every choice oi k,D terms of 
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the first register appear in superposition where D is ~ TP'^ jr. The integer D 
is ascertained by the period r for a fixed xguess by 

I /K + (Z) - l)r)) =1 /K + P - 2)r)) = ■ ■ ■ =1 /K + 10) =1 /K)) = ^• 

(50) 

The next step involves acting on register 1 to search for the period r by 
enhancing the quantum interference using a quantum Fourier transform. 



Step 5 : Quantum Fourier Transform register 1 



Register one, which is now in the state 

1 D-i 

I $)ni = ^ E I nk + 3r)m, (51) 

is next acted upon by a quantum Fourier transformation operator QFT (see 
the Appendix for a discussion of the QFT operator) which changes the above 
state 



1 D-l 2"i-l 

QFT I = ^=L= E e^™(-+^-^^)/^- I 

V2 ^L> j=Q „ = Q 

-1 2"i-l D-l 

= ^== E e^™"^/^"^^ (e2--/2"^)^-|n)„,. (52) 

The QFT is a unitary operator which switches to a basis in which the super- 
position is isolated into the above exponential amplitude. The sum on j can 
be performed!^ and thus the result is 

QFT I = -jL= e2™"*/2"^ e™(D-i)/2"i MDnn^) 

VWD sin(7™2^) 

(53) 

The probability for finding the final state with register 1 in state {n \ and 
register 2 in state {k \ is therefore 



p{n^ k) 



2"iL' 



sin(D7m2^) 



n 2 



(54) 



The value of D is constrained by the conditions < + (-D — l)r < 2"^ — 1 and 
< f^fc < r — 1. Hence, the integer D is constrained by -D < + . 

The following summation rule is used here ~ ^x-i • ^'^^ 

display this as 2-D vector additions of equal length phasors, as in Fresnel zone plate 
interference. 
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where the dependence on k has dropped out, (except for a possible dependence 
of D on k-see earher footnote). Note that for the special case that is 
an integer, the above result reduces to p{n, k) ^ ^ i, which is understood 
by returning to Eq. (jS^D and using Ylf=o i^Y = D. 

How can one extract the period r from making such a measurement on registers 
1 and 2? 



Step 6: Measure register 1 and determine period and factors 



The measurement of registers 1 and 2 has a probability given by Eq. flMj) . At 
select values of n — > ri, the probability p(n, k) has local maxima. Consider the 
associated fraction ^rrf , which is extracted from a determination of those local 
maxima. At these maxima 



sm{D7ir^] 
sin(7rr2Sr) 



(55) 



In the arguments of the sin functions in Eq. fl55]) , D is an integer, so the 
maximum probability occurs in the vicinity of an integer value for r^. We 
therefore seek an approximate value of the ratio ^ ~ integer /r, for an even r. 
That ratio is found by expressing ^ as a continued fraction and determining 
its first convergent of the form integer /r, for an even rllf. 



That determines the value of the period r, which we require to be even so that 
we can use the final step[H 

/i = GCD[xguess"/2 + 1, M]; /a = GCD[xguess'^/2 _ (gg) 

to determine the factors /i, /2 of M. The above process simplifies if the ratio 
^ = D is already an integer. 

In QCMPI the local probability maxima and the associated factors are all stip- 
ulated. In an actual measurement, one of those results would be found with 
that probability. 



Gerjuoy |19j showed that the maximum probability is not less than ~ 0.4, but 
more likely to be > ~ 0.81 

Note that the periodic function xguess''Mod[M] = xguess'^Mod[M] = 1. For even 
period r this yields (xguessa)^ — 1 = Mod[M] = (xguesss — l)(xguess2 + 1). As 
long as xguess2 is not one, at least one of (xguess2)^ibl must have a common factor 
with M, and therefore finding GCD[(xguess2 )^ it 1] yields the factors of M. 
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6 PARALLEL UNIVERSE AND NOISE 



A real quantum computer will involve the manipulation of qubits using ex- 
ternal fields and interactions with single qubits and between qubits. Clearly, 
each physical realization has its set of Hamiltonians that describe that system 
and these QC manipulations. The circuit description of QC involves gates, 
which in turn should be described by the action of Hamitonians on qubits. 
For example, the simple one-qubit Hadamard gate can be realized by rotating 
the qubit's spin axis from the z to the x axis by means of a —/I - B interaction 
acting for the proper time. More complicated gates involve clever design of 
one and two-qubit interactions. In the future, we hope that QCMPI will pro- 
vide a tool for describing all requisite gates based on Hamiltonian evolution. 
Dynamical evolution involves one- ( i^i) and two- ( H'i) body Hamiltonians 
I = [1 — |(ifi-|-i?2)'5t] I ^ (t) ). Their action over a small time interval 

bt can be calculated by repeated application of the OneOpA and TwoOpA 
codes provided in QCMPI . Such applications are the subject for future studies. 

The major obstacle to the implementation of such gates required for the suc- 
cess of QC algorithms is the strong possibility that random intrusions, such as 
noise, will decohere the quantum system and remove the essential feature of 
quantum interference. That issue behoves us to simulate the affect of noise by 
considering many replications of the QC algorithm, which ideally are identical, 
and then subject each of them to random single and double one- qubit as well 
as single two-qubit errors. For that task MPI is ideally suited and therefore, 
as a major part of this paper, we have implemented that "Parallel Universe" 
approach, for which we include herein the Grover and Shor algorithms. Other 
cases (teleportation and superdense coding) have also been implemented. Sub- 
sequent numerical studies of the efficacy of error correction protocols can be 
implemented using the framework provided by the parallel universe feature of 
QCMPI. 

The next feature of this "Parallel Universe" approach is that all of the state 
vector amplitudes can be gathered together and used to construct an ensemble 
average in the form of a density matrix. This process corresponds to solving 
a set of stochastic Schrodinger equations [2U] and using those solutions to 
produce a density matrix. Let us now examine the steps needed to construct 
a density matrix. 

6.1 Density Matrix 

There are advantages to using a density matrix to describe QC dynamics. The 
density matrix describes an ensemble average of quantum systems, with its 
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evolution determined not only by the system's Hamiltonian but also by envi- 
ronmental terms using either Kraus operator [T7| or Lindblad [18] differential 
equation forms In addition, the description of entanglement and of mixed 
states is handled nicely and concepts like entropy and Fidelity can be evalu- 
ated more readily. To form a density matrix in QCMPI and to determine the 
entropy, affords a good example of how to extend QCMPI to such ensemble 
averages. 



For a definite state vector, the pure state density matrixtlH is simply 



2"q-l 2"i-l 

p =1 \= E C*{n')C{n) I | . (57) 

n=0 n'=0 

This large (2"'« x 2"'«) matrix can be distributed over Np = T' processors by 



placing 2"'«~^ x 2"'"^ matrices on each processor.!^ Matrix multiplication, 
traces and eigenvalue determination can then be implemented using MPI pro- 
cedures, supplemented by BLACS processor grid and parallel linear algebra 
SCALAPACK programs [16]. Once the eigenvalues of p are calculated the en- 
tropy can be determined. But for a pure state, we know that p^ = p, and since 
the trace of p is one, the eigenvalues for a pure state are 1 and 2"'' — 1 zeros. 
Thus the entropy is zero, as it should be for a well-defined, non-chaotic, albeit 
probabilistic state. 

How do we go beyond a pure state density matrix within the QCMPI setup? 
There are several options, but one overall goal. The overall goal is to build a 
state I \E'q,) repeatedly as labeled by a, with an associated probability Va with 
J2a "Pa = 1- For each case, the state | \E'q) could be generated in a different way. 
One option is to get a set of amplitudes Ca{n) randomly, with each random set 
assigned a probability Va- Another way is to select a few qubits and subject 
them to random one and two body interactions and possible stochastic pulses 
(noise), again assigning each case a probability Va- The associated mixed state 
density matrix would then be 

2"<7-l 2"9-l 

p = ^PJ ^a){'^a 1= E E E^« C:{n')Ca{n) \ u) {u' \ . (58) 

a n=0 n'=0 a 



The above result can be expressed as 
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(n 



p\n')=J2'Pc.Ca{n)C:{n'). (59) 



The density matrix is Hermitian, has unit trace, and is positive definite. In general 
< p, with the equal sign applied for pure states. 

To facilitate the parallel treatment of the density matrix, we take p as even. 
-"^"^ An abbreviated version is p = Ylia^a I Q')(ct 1; with Ca{n) = {n\ a). 



30 



This is perhaps not the most general density matrix, but one can trace out 
some of the ancilla qubits and/or subject the density matrix to additional en- 
tangling operations using p' = UpW or even apply the non-unitary Lindblad 
[12] process to generate an enhanced range of density matrices. These proce- 
dures, which we outline here, are included in this version of QCMPI to facilitate 
studies of decoherence and environmental effects. A major advantage of QCMPI 
is that the invocation of parallel universes (aka multiverses) to describe the 
influence of noise on a QC does not involve much increase in computation 
time compared to a single pure run, especially since the only communication 
between groups is that used is to construct the density matrix. This scheme 
provides an efficient use of multi-processor computers. 



6.2 Parallel universe implementation 

The above steps are implemented in QCMPI by first splitting the overall num- 
ber of processors iVp (nprocU) into many groups No, each group is referred 
to as a "multiverse." For convenience, we take both Np and Ng to be pow- 
ers of 2. Within each multiverse, there are Np/Nc = Ng (nprocM) proces- 
sors that are used to perform a distinct QC algorithm. The MPI command 
MPI_COMM_SPLIT is used to produce these separate groups. Each group 
is specified by its group rank (rankM), which ranges from zero to NGROUPS- 
1, where NGROUPS denoted the total number of multiverses. l£. 

The method used to store and evaluate the density matrix is controlled by 
an integer lentropy. For the choice lentropy=0, there is no evaluation of 
the density matrix. For the choice Ientropy=l, the full density matrix is 
constructed on the master processor and its eigenvalues determined by a LA- 
PACK code. That procedure should be used when storage space for p is ample. 
For Ientropy=2 the density matrix is not stored on one processor, but is dis- 
tributed on a BLACS generated processor grid and the parallel eigenvalue code 
PCHEEVX from the SCALAPACK package is invoked to evaluate p's eigen- 
values. To carry out this last task the number of processors, groups and qubits 
have to be carefully monitored for consistency with the codes conventions, as 
indicated directly in the listings. 



° There are spawning features of MPI-2 that might be invoked to carry out this 
process more efficiently, but at this stage we found MPI-1 sufficient for our needs. 
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6.3 Noise scenarios 



A simple example of a "noise scenario" has been included I ^^lin QCMPI to show 
how the role of noise can be examined. The motivation here is to first introduce 
noise and later to evaluate various error correction schemes. 

The division of a large number of processors into groups was made so that 
only the first group (rankM=0) functions without noise. All of the other groups 
perform the algorithm with noise. The noise is introduced separately for each 
group (or multiverse) where the users can design their own scenarios. We 
have input noise using a one-qubit unitary operator (subroutine D2) that we 
take as a 2 X 2 Wigner rotation function 1^2 (a, /3, 7), where 7 are three 



Euler angles, [f^ This can be specialized to either small deviations or, within 
a phase, to one of the Pauli operators. One can introduce one qubit noise, 
acting on a random qubit, typically once within each processor multiverse, 
but two or more one-qubit noise intrusions can be invoked at various stages 
of the algorithm, by suitable placement of the subroutine "Noise." 

In addition, a two-qubit unitary operator (subroutine D4) that we take as 
a 4 X 4 Wigner function T'2(a,/3, 7), can also be specialized to either small 
deviations or, within a phase, to one of the Pauli operator products ctj ® Cj. 
This allows one to introduce a single error that acts on two qubits once, in 
contrast to two one-qubit errors. 

The one-qubit operator is assumed to act on a random selected qubit (qhit) 
and at selected, variable stages of the algorithm (eloc). Extension to two-qubit 
noise is obvious. Of course the associated universes which allow two one-qubit 
or single two-qubit errors should carry lower weight. 

By using unitary operators in each universe the overall density matrix still 
maintains unit trace, but of course the trace of will be decreased by noise. 
The probability of success will also decrease. 

Thus, QCMPI provides a framework for introducing errors and, along with 
Hamiltonian-driven gates, provides an important tool for dynamical studies 
of QC with noise and in the future with error correction. 



See subroutine Noise called in subgrover.fQO and subshor.f90 
We take random a, /3, and set 7 = for simplicity. 
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7 FORTRAN AND MPI CODES 



Sample QCMPI codes are provided which incorporate directions as to how to 
run the code. From these examples, the user should be able to see the benefit 
of being able to handle problems with a considerable number of qubits, orga- 
nized into parallel universes, in reasonable time. Some improvements could be 
invoked to accelerate QCMPI , for example by collecting messages and sending 
them as a group (collective communications). The issue here is the standard 
fight between sharing the work load over the available processors (balance) 
and minimizing the cost of sending messages. However, the major benefit of 
dividing a large number of processors into multiverses and subjecting each 
one to separate noise scenarios, is in itself justification and reason to use a 
multi-processor supercomputer. 

The list of files contained in QCMPI are: 

• qcmpisubs.f90, contains all QCMPI subroutines 

• Guniv.fQO, builds multiverse environment for Grover's search 

• subgrover.f90, Grover's search routine 

• Suniv.fQO, builds multiverse setup for Shor's factoring algorithm 

• subshor.fQO, Shor's factoring routine 

• makefile, sample of compiling options for several super computing facilities 

• *.job, sample job submitting scripts 

• README, instructions. 

7. 1 Performance 

As an indication of the performance of QCMPI , we have run a number of sample 
cases using the multiverse Grover codes included in the package. In tabled! we 
show the global memory requirements together with the wallclock time and 
the percentage of the latter used in MPI operations. 



33 



Table 1 

Performance of a number of sample runs all realized using Guniv.f, subgrover.fOO and 
qcmpisubs.f90. In all cases, the Ientropy=2 option is chosen and thus the entropy is 
computed making use of the scalapack routines and two multiverses are considered. 
The Gflop/sec and Gbytes refer to the total amount used by all processors. The 
information presented has been obtained using IPM [22j . 



NP 


nq 


2nq 


Gflop/sec 


Gbytes 


Wallclock (sec) 


% communication 


4 


10 


1024 


1.99618 


1.4043 


1.25 


26.63 


4 


12 


4096 


3.31855 


9.96069 


48.98 


5.65 


16 


10 


1024 


1.90923 


5.54911 


1.07 


62.27 


16 


12 


4096 


10.4583 


38.7235 


11.29 


39.14 


64 


10 


1024 


0.65133 


22.1393 


3.21 


31.19 


64 


12 


4096 


15.2444 


153.814 


7.54 


56.32 



8 CONCLUSIONS AND FUTURE DEVELOPMENTS 



In conclusion, the Fortran 90 code QCMPI provides a modular approach to 
quantum algorithms that provides accessible implementation of quantum com- 
putation algorithms. All of the gates needed for the circuit model are provided, 
as well as the quantum Fourier transformation procedure. Extension to three- 
qubit operators and to the one-way model of computation are straightfoward, 
as is the extension to the qutrit case. Such extensions will be provided in the 
future by the authors and hopefully also by interested users. 

The main features of QCMPI are the distribution of state-vector amplitudes 
over processors, to allow for increased number of qubits and the use of MPI 
to carry out the requisite communication needed when one- and two- body 
operators (gates) act on states. This task is carried out in a manner that allows 
ready extension to Hamiltonian driven QC dynamics. 

In addition. QCMPI provides a multi-universe setup, which replicates the QC 
algorithm over many groups, at little cost in computation time. That pro- 
cedure provides a major advantage of QCMPI , not generally available in the 
literature, to provide a framework for studying the role of noise on the efficacy 
of QC. That is, we believe, the major task in this subject. 

The methods demonstrated here for the distribution and evaluation of a large 
density matrix can be generalized to the case of large unitary matrices to 
represent gates. 

There is much to do with this tool such as studies of: Hamiltonian driven 
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QC dynamics using realistic Hamiltonians, along with environmental effects, 
inffuence of random pulses, and efficacy of error correction protocols. 
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A The quantum Fourier transform circuit and QCMPI 



The quantum Fourier transform is performed using the circuit in Fig.|Xl A 
ladder of Hadamards and two-qubit control CPHASEK gates (Eq. fl36|) ) are 
used to produce the QFT. 
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Fig. A.l. The quantum Fourier transform circuit. Here (pk = e^'^^l'^^ and the register 
one has n\ qubits.The binary number q\q2 • • ■ Qm corresponds to a decimal number 
n which ranges as < n < 2"^ — 1. The fraction binary notation is used where 

-. To restore the qubits to standard order with 



2s. _L ga + l _i_ 

2 "I" "I" 



+ 



Qb 



2b~a + 



O.qa ■■■qb 

the most significant bit to the left (top of figure) , an addition reversal of the qubit 

^1 

order must be applied. An overall normalization of l/2~ is understood. 

The above steps are carried out in QCMPI with the following code, which in- 
cludes a series of pair swaps to reorder the qubits labels. 



Do ic =l,nl-l 

call OneOpA (nq , ic , had , psi , NPART , COMM) 
Do k=ic+l,nl 

cal 1 CPHASEK (nq,k,ic,k+l-ic,psi, NPART , COMM) 

enddo 

enddo 

! Final Hadamard 

call OneOpA (nq , nl , had , psi , NPART , COMM) 



! Reverse order using pair swaps 
Do i=l,nl/2 

call SWAP (nq, i , nl+l-i , Psi , NPART , COMM) 
enddo 



Here had denotes the Hadamard, psi is the input and then the output state 
vector at each stage, and NPART denotes the part of psi on the current 
processor. The qubits are restored to standard order by a set of pair swaps. 
This also demonstrates how to use the CPHASEK, OneOP(for a Hadamard 
case), and the SWAP subroutine: 

To understand how the ladder of Hadamard and control phase gates yields a 
quantum Fourier transform, note that the final state shown in the code. 
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qiq2 ■ --qm) ^ ^(| 0) + e''^^ | 1)) ® (| 0) + e"'^^ | 1)) 

® • • • (I 0) + e^" I 1)) ^ (I 0) + e^"^ I 1)) , 

(A.l) 



which becomes 



I ?i?2 • ■■qm) ^ ^(1 0) + e''^^ I 1)) (8) (I 0) + e''^' | 1)) 

• • • (I 0) + e^" °-<'2-5ni I ® (I 0) + e^" I 1)) , 

(A.2) 

after the final qubits are reordered by a series of pair swap operations. The 
last result can be written as: 

QFT I qiq2 ■ ■ • Qm) = ^ C^^' [9l0.g„^+q^0.g„^-iq„j-+q;^_l0.q2-?ni+?;^0.qi-?„^] | 

Q' 

(A.3) 

where Q' denotes the binary number q[q2 ■ • ■ corresponding to the decimal 
number n'. The above is equivalent to 

1 N-l 

QFT \n) = ^Yl e'™"'/^ | n'). (A.4) 

n'=Q 

where = 2"^. A simple product nn'/N appears in the exponent because 

[q'lO-qni + gaO-gni-igni h ^ni-lO-^S " " " + ^niO-^l " " " ^ W/A^, which 

can be shown by noting that e^'^^^^ = 1 for all integers s > 0. Therefore, in the 
product nn'/N = (gi2"i-i ^ 522"^-^ • • • qn,2°){q[2''^-^ + g^2"i-2 • • • g;^2°), we 
can drop all cross terms that yield a 2^ which suffices to prove the equivalence. 
Hence, one sees that the QFT is a unitary transformation from basis | n) to 
I n') of the form {n \ QFT | n') = ^ ^^27ri/N-jnn' _ 



B The MPI Codes 



The bintodec, dectobin, OneOpA, TwoOpA, EulerPhi, splitn, ProjA, 
Randx, QFT, CF, SWAP, CPHASEK, HALL, HALL2, Entropy, En- 
tropyP codes are best understood by examination of the explicit directions 
within the code and also by the useage in the sample algorithms. 
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B.l Sample algorithm codes 



Teleportation and Superdense coding codes are also available. In this paper, 
we present the Grover and Shor cases. The Grover code is called Guniv.fQO 
and initiates the process by selecting a marked item that is to be searched 
for, with that item (labelled as IR) distributed to all the processors. There are 
Np = 2P processors that are split into A^^^ = 2^ groups (called multiverses) 
each multiverse then consists of = Np/Nq = 2^"^ members, where both 
Np and Nq are assumed to be powers of 2. Independent searches are carried 
out in each multiverse and at the end (this could be done at any preferred 
stage) the state-vector amplitudes for each group are used to form a group's 
density matrix. An overall ensemble average of all the group's density matrices 
are then computed and either the 2"« x 2"« array is located on the master 
processor of the first group using subroutine Entropy or is distributed over a 
BLASC grid using subroutine EntropyP. 

The first group (rankM=0) is free of noise, whereas all the other groups 
(rankM>0) are subject to various random disturbances with assigned prob- 
ablities. This is where particular noise models could be invoked by the user. 
This structure is also used for the Shor case. 

In the Shor case (Suniv.f90, subshor.f90), there is an initial setup process to 
pick and test the number to be factored that is broadcast to all Np processors, 
with again a split into Nq groups (multiverses) and separate searches done on 
the Ng members of each universe. Again group one is free of noise, whereas 
noise is introduced on all other groups, with a subsequent build up of the full 
density matrix using either the ientropy=l or ientropy=2 options. Others 
cases and extensions all follow this same general pattern. 

One can also examine the particular eigenvalues of the full density matrix, at 
selected stages, and also obtain fidelities and subtraces if desired. 
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